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Abstract 

Graphical models such as Bayesian networks have been used successfully to 
capture dependencies between entities of interest from observational data sets. 
Graphical abstractions can also provide system-level insights, hence of great 
interest across a spectrum of disciplines. Identifying significant edges in the 
resulting graph have traditionally relied on the choice of an ad-hoc threshold, 
which can have a pronounced impact on the network topology and conclusions. 
In the present study, a statistically-motivated approach that obviates the need 
for an ad-hoc threshold is proposed for identifying significant edges. Several 
aspects of the proposed approach are investigated across synthetic as well as 
experimental data sets. 
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1. Introduction and Background 

Graphical models [3, H| are a class of statistical models which combine the 
rigour of a probabilistic approach with the intuitive representation of relation- 
ships given by graphs. They are composed by a set X = {Xi,X 2 , ■ ■ ■ ,Xn} of 
random variables describing the quantities of interest and a graph Q = (V, E) in 
which each node or vertex 11 £ Vis associated with one of the random variables 
in X (they are usually referred to interchangeably). The edges e £ E are used 
to express the dependence relationships among the variables in X. The set of 
these relationships is often referred to as the dependence structure of the graph. 
Different classes of graphs express these relationships with different semantics, 
which have in common the principle that graphical separation of two vertices 
implies the conditional independence of the corresponding random variables Q • 
The two examples most commonly found in literature are Markov networks 
[H, 13], which use undirected graphs, and Bayesian networks 0, @, which use 
directed acyclic graphs. 
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In principle, there are many possible choices for the joint distribution of 
X, depending on the nature of the data. However, literature have focused 
mostly on two cases: the discrete case [3|, [7[, in which both X and the Xi are 
multinomial random variables, and the continuous case [H, HI, in which X is 
multivariate normal and the Xi are univariate normal random variables. In the 
former, the parameters of interest are the conditional probabilities associated 
with each variable, usually represented as conditional probability tables; in the 
latter, the parameters of interest are the partial correlation coefficients between 
each variable and its neighbours (i.e. the adjacent nodes in Q). 

The estimation of the structure of the graph Q is called structure learning 
P, IH, and involves determining the graph structure that encodes the condi- 
tional independencies present in the data. Ideally it should coincide with the 
dependence structure of X, or it should at least identify a distribution as close 
as possible to the correct one in the probability space. Several algorithms have 
been presented in literature for this problem, thanks to the application of many 
results from probability, information and optimisation theory. Despite differ- 
ences in theoretical backgrounds and terminology, they can all be grouped into 
only three classes: constraint-based algorithms, that are based on conditional 
independence tests; score-based algorithms, that are based on goodness-of-fit 
scores; and hybrid algorithms, that combine the previous two approaches. For 
some examples see Bromberg et al. Castelo and Roverato [n}, Friedman ct 



al. [11], Larrahaga et al. [12[ and Tsamardinos et al. [13| . 

On the other hand, the development of techniques for assessing the statistical 
robustness of network structures learned from data (e.g. the presence of artefacts 
arising from noisy data) has been limited. Structure learning algorithms are 
commonly studied measuring differences from the true (known) structure of a 



small number of reference data sets [14|, [15| . The usefulness of such an approach 



in investigating networks learned from real-world data sets is limited, since the 
true structure of their probability distribution is unknown. 

A more systematic approach to model assessment, and in particular to the 
problem of identifying statistically significant features in a network, has been 
developed by Friedman et al. [l6j using bootstrap resampling [l7| and model 
averaging [l8|. It can be summarised as follows: 



For b = 1,2, ... ,m: 

(a) sample a new data set X£ from the original data X using either 
parametric or nonparametric bootstrap; 

(b) learn the structure of the graphical model Gb = (V, E b ) from XJ. 
Estimate the probability that each possible edge e,, i = 1, . . . , k is present 
in the true network structure Go = (V, E ) as 

b=l 

where ~5Li ei £E b } is the indicator function of the event {e^ S E b } (i.e., it is 
equal to 1 if G Eb and otherwise) . 
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The empirical probabilities P(e%) are known as edge intensities or arc strengths, 
and can be interpreted as the degree of confidence that e 2 ; is present in the net- 
work structure Go describing the true dependence structure of However, 
they are difficult to evaluate, because the probability distribution of the net- 
works Qb in the space of the network structures is unknown. As a result, the 
value of the confidence threshold (i.e. the minimum degree of confidence for an 
edge to be significant and therefore accepted as an edge of Qo) is an unknown 
function of both the data and the structure learning algorithm. This is a se- 
rious limitation in the identification of significant edges and has led to the use 
of ad-hoc, pre-defined thresholds in spite of the impact on model assessment 
evidenced by several studies An exception is Nagarajan et al. (20j . 

whose approach will be discussed below. 

Apart from this limitation, Friedman's approach is very general and can be 
used in a wide range of settings. First of all, it can be applied to any kind of 
graphical model with only minor adjustments (for example, accounting for the 
direction of the edges in Bayesian networks, see Sec. U]). No distributional as- 
sumption on the data is required in addition to the ones needed by the structure 
learning algorithm. No assumption is made on the latter, either, so any score- 
based, constraint-based or hybrid algorithm can be used. Furthermore, parallel 
computing can easily be used to offset the additional computational complexity 
introduced by model averaging, because bootstrap is embarrassingly parallel. 

In this paper, we propose a statistically-motivated estimator for the confi- 
dence threshold minimising the L\ norm between the cumulative distribution 
function of the observed confidence levels and the cumulative distribution func- 
tion of the confidence levels of the unknown network Qo- Subsequently, we 
demonstrate the effectiveness of the proposed approach by re-investigating two 
experimental data sets from Nagarajan et al. [2Cf and Sachs et al. [211 ]. 



2. Selecting Significant Edges 

Consider the empirical probabilities P(ei) defined in Eq. [T] and denote 
them with p = {pi, i — 1, . . . , k}. For a graph with N nodes, k = N(N — l)/2. 
Furthermore, consider the order statistic 

P(-) = {P(i),P(2),---,P(k)) with < P(2) < • • • < P(fe) ( 2 ) 

derived from p. It is intuitively clear that the first elements of p(.) are more likely 
to be associated with non-significant edges, and that the last elements of p(.) 
are more likely to be associated with significant edges. The ideal configuration 
P(.) of p(.) would be 

P i = { 1 ^ E E ° (3) 
^ I otherwise 



1 The probabilities P(ei) are in fact an estimator of the expected value of the {0, 1} random 
vector describing the presence of each possible edge in Qo. As such, they do not sum to one 
and are dependent on one another in a nontrivial way. 
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that is the set of probabilities that characterises any edge as either significant 
or non-significant without any uncertainty. In other words, 

p (0 = {0,...,0,1,...,1}. (4) 

Such a configuration arises from the limit case in which all the networks Qb 
have exactly the same structure. This may happen in practise with a consistent 
structure learning algorithm when the sample size is large [22|, |23( . 

A useful characterisation of p>(.) and pV.) can be obtained through the em- 
pirical cumulative distribution functions of the respective elements, 

1 k 

F p()( x ) = %(«<«> ( 5 ) 

t=l 

and 

!0 if x G (-oo,0) 
t if a; G [0,1) . (6) 
1 if x G [l,+oo) 

In particular, t corresponds to the fraction of elements of p>(.) equal to zero 
and is a measure of the fraction of non-significant edges. At the same time, t 
provides a threshold for separating the elements of p>(.), namely 

e (i) EE *=*p (i) >F p{ 1 ) (t) (7) 

where (t) = inf^gR {Fp ( } (x) ^ i} is the quantile function [24j |. 

More importantly, estimating t from data provides a statistically motivated 
threshold for separating significant edges from non-significant ones. In prac- 
tise, this amounts to approximating the ideal, asymptotic empirical cumulative 
distribution function Fp ( } with its finite sample estimate Fp { ) . Such an approx- 
imation can be computed in many different ways, depending on the norm used 
to measure the distance between . and Fp { } as a function of t. Common 
choices are the L p family of norms [25[ , which includes the Euclidean norm, and 
Csiszar's /-divergences ^26], which include Kullback-Leibler divergence. 

The L\ norm 

L 1 (t; P( . ) )= J \F Pn (x)~F H) {x;t)\dx (8) 

appears to be particularly suited to this problem; an example is shown in Fig. 
[TJ First of all, note that Fp { ) is piecewise constant, changing value only at the 
points |5(j) ; this descends from the definition of empirical cumulative distribution 
function. Therefore, for the problem at hand Eq. [8] simplifies to 

£i(*;P(.))= ^2 \Fp 0) (xi) -t\(x i+1 -Xi), (9) 

^ 1 6{{0}up(. ) u{i}} 
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Figure 1: The empirical cumulative distribution function Fp^ (left), the cumulative distri- 
bution function Fp (centre) and the L\ norm between the two (right), shaded in grey. 



which can be computed in linear time from fir.) . Its minimisation is also straight- 
forward using linear programming [271 ] . Furthermore, compared to the more 
common norm 

L 2 (i; P( .)) = j [F H) (x) - F P(0 (as; t)] 2 dx (10) 

or the Loo norm 

£oo(*;P(.))= ™?xA\ F P()( x ) --^Pf.)^*)!}' ( n ) 
xe[o,i] 

the Li norm does not place as much weight on large deviations compared to 
small ones, making it robust against a wide variety of configurations of ftr.y 

Then the identification of significant edges can be thought of either as a least 
absolute deviations estimation or an ii approximation of the form 

f = argmin L\ (t; p ( .)) (12) 

te[o,i] 

followed by the application of the following rule: 

e w ei?o^P W >L- ( 1 > (£). (13) 

Note that, even though edges are individually identified as as significant or non- 
significant, they are not identified independently of each other because t is a 
function of the whole p>(.)- 

A simple example is illustrated below. 

Example 1. Consider a graphical model based on an undirected graph Q with 
node set V = {A, B, C, D}. The set of possible edges of Q contains 6 elements: 



5 




t i i i i r 

0.0 0.2 0.4 0.6 0.8 1.0 



X 



t 



Figure 2: The cumulative distribution functions F p ^ . and Fp^ , (t), respectively in black and 
grey (left), and the L\ (i;p+j) norm (right) from Example[T] 

(A, B), (A,C), (A, D), (B,C), (B, D) and (C,D). Suppose that that we have 
estimated the following confidence values: 



Pab = 0.2242, 
psc = 0.3921, 



Vac = 0.0460, 
Pbd = 0.7689, 



Pad = 0.8935, 
Pcd = 0.9439. 



(14) 



Then p ( .) = {0.0460,0.2242,0.3921,0.7689,0.8935,0.9439} and 

fO if x e (-co, 0.0460) 
1 



F Pi ,(x) 



= < 



if x G [0.0460,0.2242) 
if x G [0.2242,0.3921) 
if x G [0.3921,0.7689) 
if x G [0.7689,0.8935) 
if x G [0.8935,0.9439) 



(15) 



U if x £ [0.9439, +oo) 
The L\ norm takes the form 



Li (*;P(.)) = |0-t| (0.0460-0) 
2 



t (0.3921 - 0.2242) 
t (0.8935- 0.7689) 



(0.2242 - 0.0460)+ 

-t (0.7689 - 0.3921)+ 

t (0.9439 - 0.8935)+ 

|1 -t\ (1 -0.9439) (16) 
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and is minimised for i — 0.4999816. Therefore, an edge is deemed significant if 
its confidence is strictly greater than Fp (0.4999816) = 0.3921, or, equivalently, 
if it has confidence of at least 0.7689; only (A, D), {B,D) and (C,D) satisfy 
this condition. 



3. Simulation Results 

We tested the proposed approach on synthetic data sets using three estab- 
lished performance measures: sensitivity, specificity and accuracy. Sensitivity 
is given by the proportion of edges of the true network structure that have been 
correctly identified as significant. Specificity is given by the proportion of the 
edges missing from the true network structure that have been correctly identi- 
fied as non-significant. Accuracy is given by the proportion of edges correctly 
identified as either significant or non-significant over the set of all possible edges. 
To that end, we generated 400 data sets of varying sizes (100, 200, 500, 1000, 
2000, 5000, 10000 and 20000) from three discrete Bayesian networks commonly 
used as benchmarks: 

• the ALARM network [28j], a network designed to provide an alarm mes- 
sage system for intensive care unit patient monitoring. Its true structure 
is composed by 37 nodes and 46 edges (of 666 possible edges), and its 
probability distribution has 509 parameters; 

• the HAILFINDER network 29], a network designed to forecast severe 
summer hail in northeastern Colorado. Its true structure is composed 
by 56 nodes and 66 edges (of 1540 possible edges), and its probability 
distribution has 2656 parameters; 

• the INSURANCE network [30[ , a network designed to evaluate car insur- 
ance risks. Its true structure is composed by 27 nodes and 52 edges (of 
351 possible edges), and its probability distribution has 984 parameters. 

Three different structure learning algorithms were considered: 

• the Incremental Association Markov Blanket (IAMB) constraint-based al- 



gorithm [31| . IAMB was used to learn the Markov blanket of each node as 
a preliminary step to reduce the number of its candidate parents and chil- 
dren; a network structure satisfying these constraints is then identified as 
in the Crow-Shrink algorithm [32J. Conditional independence tests were 
performed using a shrinkage mutual information test (33| with a ~ 0.05. 
Such a test, unlike the more common asymptotic x 2 mutual information 
test, is valid and has been shown to work reliably even on small samples. 
An a = 0.01 was also considered; however, the results were not signifi- 
cantly different from a = 0.05 and will not be discussed separately in this 
paper; 

the Hill Climbing (HC) score-based algorithm with the Bayesian Dirichlet 
equivalent uniform (BDeu) score function, the posterior distribution of 
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the network structure arising from a uniform prior distribution 7]. The 
equivalent sample size was set to 10. This is the same approach detailed 
in Friedman et al. [l6|, although they considered only 100 (instead of 500) 
bootstrap samples for each scenario; 

• the Max-Min Hill Climbing (MMHC) hybrid algorithm [13], which com- 
bines the Max-Min Parents and Children (MMPC) and HC. The condi- 
tional independence test used in MMPC and the score functions used in 
HC are the ones illustrated in the previous points. 

The performance measures were estimated for each combination of network, 
sample size and structure learning algorithm as follows: 

1. a sample of the appropriate size was generated from either the ALARM, 
the HAILFINDER or the INSURANCE network; 

2. we estimated the confidence values p for all possible edges from 200 and 
500 nonparametric bootstrap samples. Since results are very similar, they 
will be discussed together; 

3. we estimated the confidence threshold t, and identified significant and 
non-significant edges in the network. Note that the direction of the edges 
present in the network structure is effectively ignored, because the pro- 
posed approach focuses only those edges' presence. Significant edges were 
then used to build an averaged network structure; 

4. we computed sensitivity, specificity and accuracy comparing the averaged 
network structure to the true one, which is known from literature. 

These steps were repeated 50 times in order to estimate both the performance 
measures and their variability. 

All the simulations and the thresholds estimation were performed with the 
bnlearn package (34]j HH for R [36| , which implements several methods for struc- 
ture learning, parameter estimation and inference on Bayesian networks (includ- 
ing the approach proposed in Sec. [J). 

The average values of sensitivity, specificity, accuracy and t for the networks 
across various sample sizes (n) are shown in Fig. [3] (IAMB), Fig. [4] (HC) and 
Fig. [5] (MMHC). Since the number of parameters is non-constant across the 
networks, a normalised ratio of the size of the generated sample to the number 
of parameters of the network (i.e. n/p) is used as a reference instead of the 
raw sample size (i.e. n). Intuitively, a sample of size of n = 1000 may be large 
enough to estimate reliably a small network with few parameters, say p = 100, 
but it may be too small for a larger network with p = 10000. On a related note, 
denser networks (i.e. networks with a large number of edges compared to the 
number of nodes) usually have a higher number of parameters than sparser ones 
(i.e. networks with few edges). 

Several interesting trends emerge from the estimated quantities. As ex- 
pected, sensitivity increases as the sample size grows. This provides an empirical 
verification that the combination of HC and BDe is indeed consistent, as proved 
by Chickering (23|. No analogous result exists for IAMB or MMHC, although 
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Figure 3: Average sensitivity, specificity and accuracy of IAMB for the ALARM, HAIL- 
FINDER and INSURANCE networks over n/p. Bars represent 95% confidence intervals, and 
the dotted vertical line is n = p. 
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Figure 4: Average sensitivity, specificity and accuracy of HC for the ALARM, HAILFINDER 
and INSURANCE networks over n/p. Bars represent 95% confidence intervals, and the dotted 
vertical line is n = p. 
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Figure 5: Average sensitivity, specificity and accuracy of MMHC for the ALARM, HAIL- 
FINDER, and INSURANCE networks over n/p. Bars represent 95% confidence intervals, and 
the dotted vertical line is n = p. 



intuitively their sensitivity should improve as well with the sample size due to 
the consistency of the conditional independence tests used by those algorithms. 
Moreover, even when n/p is extremely low a substantial proportion of the net- 
work structure can be correctly identified. When n/p is at least 0.2 (i.e. 1 
observation every 5 parameters), HC successfully recovers from about 50% (for 
ALARM and INSURANCE) to 75% (for HAILFINDER) of the true network 
structure. In contrast, IAMB and MMHC successfully recover from about 45% 
to 50% of HAILFINDER, but only about 26% to 40% of ALARM and 19% to 
30% of INSURANCE. This difference in performance can be attributed to the 
sparsity-inducing effect of shrinkage tests [33|, which increase specificity at the 
cost of sensitivity. For values of n/p greater than 1 (i.e. more observations 
than parameters) the increase in sensitivity slows down for all combinations of 
networks and algorithms, reaching a plateau. 

Overall, sensitivity seems to have an hyperbolic behaviour, growing very 
rapidly for n/p 1 and then converging asymptotically to 1 for n/p > 1. Thus 
we expect it to increase linearly on a \og(n/p) scale. The slower convergence rate 
observed for the INSURANCE network compared to the other two networks is 
likely to be a consequence of its high edge density (1.92 edges per node) relative 
to ALARM (1.24) and HAILFINDER (1.17). Slower convergence may also be 
an outcome of inherent limitations of structure learning algorithms in the case 
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Figure 6: Average estimated significance threshold (t) for the ALARM, HAILFINDER and 
INSURANCE networks over n/p. Bars represent 95% confidence intervals. 



of dense networks 0, HH . 

Furthermore, both specificity and accuracy are close to 1 for all the networks 
and the sample sizes considered in the analysis, even at very low n/p ratios. Such 
high values are a result of the low number of true edges in ALARM, HAIL- 
FINDER and INSURANCE compared to the respective numbers of possible 
edges. This is true in particular for the ALARM and HAILFINDER networks. 
The lower values observed for the INSURANCE network can be attributed 
again to the inherent limitations of structure learning algorithms in modelling 
dense networks. The sparsity-inducing effect of shrinkage tests is again evident 
for both IAMB and MMHC; both specificity and accuracy actually decrease 
slightly as n/p grows and the influence of shrinkage decreases. 

It is also important to note that, as shown in Fig. [6j the average value of 
the confidence threshold t does not exhibit any apparent trend as a function 
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of n/p. In addition, its variability does not appear to decrease as n/p grows. 
This suggests that the optimal t depends strongly on the specific sample used 
in the estimation of the confidence values p, even for relatively large samples. 
However, specificity, sensitivity and accuracy estimates appear on the other 
hand to be very stable (all confidence intervals shown in Fig. [31 Fig. @]and Fig. 
03 are very small) . 

From Fig. [6l it is also apparent that the threshold estimate t can be signifi- 
cantly lower than 1 even for high values of n/p. This behaviour is observed con- 
sistently across the three networks (ALARM, HAILFINDER, INSURANCE). 
These results are in sharp contrast with ad-hoc thresholds commonly found 
in literature, which are usually large [e.g. 0.8 in 16]. A large threshold can 
certainly be useful in excluding noisy edges, which may result from artefacts 
at the measurement and dynamical levels and from finite sample-size effects. 
However, while a large ad-hoc threshold can certainly minimise false positives, 
it is also expected to accentuate false negatives. Such a conservative choice 
can have a profound impact on the network topology, resulting in artificially 
sparse networks. The threshold estimator introduced in Sec. [5] achieves a good 
trade-off between incorrectly identifying noisy edges as significant and disre- 
garding significant ones. As an example, the difference in sensitivity, specificity 
and accuracy between the estimated threshold i and several large, ad-hoc ones 
(t = 0.70, 0.80, 0.90, 0.95) for HC is shown in Fig. (the corresponding plots 
for IAMB and MMHC are similar, and are omitted for brevity). The threshold 
t systematically outperforms the ad-hoc thresholds in terms of sensitivity, in 
particular for low values of n/p. The difference progressively vanishes as n/p 
grows. All thresholds have comparable levels of specificity and accuracy. 

On a related note, false negatives across ad-hoc thresholds may also be at- 
tributed to the fact that edges are considered as separate, independent entities 
as far the choice of the threshold is concerned - i.e. a 0.99 threshold is expected 
to identify as significant about 1 in 100 edges in the network. However, in a 
biological setting the structure of the network is an abstraction for the under- 
lying functional mechanisms; as an example, consider the signalling pathways 
in a transcriptional network. In such a context, edges are clearly not indepen- 
dent, but appear in concert along signalling pathways. This interdependence is 
accounted for in the proposed approach (that is based on the full set p of esti- 
mated condence values), but it is not commonly considered in choosing ad- hoc 
thresholds. For instance, edges appearing with individual confidence values far 
below the [0.80, 1] range may not necessarily be identified as significant by an 
ad-hoc threshold. However, the proposed approach recognises their interplay 
and correctly identifies them as significant. This aspect, along with the strong 
dependence between the optimal t and the actual sample the network is learned 
from, may discourage the use of an a priori or ad-hoc confidence threshold in 
favour of more statistically-motivated alternatives. 
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Figure 7: Difference in sensitivity, specificity and accuracy between the estimated threshold i 
and several ad-hoc ones (t = 0.70, 0.80, 0.90, 0.95) for HC over n/p. 



4. Applications to Molecular Expression Profiles 

In order to demonstrate the effectiveness of the proposed approach on exper- 
imental data sets, we will examine two gene expression data sets from Nagarajan 



et al. |20| and Sachs et al. [2l|. All the anal yses will be performed again with 
the bnlearn package. Following Imoto et al. [39j, we will consider the edges of 
the Bayesian networks disregarding their direction when determining their sig- 
nificance. Edges identified as significant will then be oriented according to the 
direction observed with the highest frequency in the bootstrapped networks Qb- 
While simplistic, this combined approach allows the proposed estimator to han- 
dle the edges whose direction cannot be determined by the structure learning 
algorithm possibly due to score equivalent structures [40[ . 
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Figure 8: The empirical cumulative distribution function Fp^ ^ for the myogenic progenitors 
data from Nagarajan et al. I '21 1 (on the left), and the network structure resulting from the 
selection of the significant edges (on the right). The vertical dashed line in the plot of _F p( 

represents the threshold F" 1 (t). 



4-1- Differentiation Potential of Aged Myogenic Progenitors 

In a recent study 20] the interplay between crucial myogenic (Myogenin, 
Myf-5, Myo-Dl), adipogenic (C/EBPa, DDIT3, FoxC2, PPAR7), and Wnt- 
related genes (Lrp5, Wnt5a) orchestrating aged myogenic progenitor differenti- 
ation was investigated by Nagarajan et al. using clonal gene expression profiles 
in conjunction with Bayesian network structure learning techniques. The ob- 
jective was to investigate possible functional relationships between these diverse 
differentiation programs reflected by the edges in the resulting networks. The 
clonal expression profiles were generated from RNA isolated across 34 clones 
of myogenic progenitors obtained across 24-month-old mice and real-time RT- 
PCR was used to quantify the gene expression. Such an approach implicitly 
accommodates inherent uncertainty in gene expression profiles and justified the 
choice of probabilistic models. 

In the same study, the authors proposed a non-parametric resampling ap- 
proach to identify significant functional relationships. Starting from Friedman's 
definition of confidence levels (Eq. Q]), they computed the noise floor distribu- 
tion f = {fx, fa, ■ ■ . 1 fk} of the edges by randomly permuting the expression of 
each gene and performing Bayesian network structure learning on the resulting 
data sets. An edge was deemed significant if Pi > max^ g jj In addition 
to revealing several functional relationships documented in literature, the study 
also revealed new relationships that were immune to the choice of the structure 
learning techniques. These results were established across clonal expression data 
normalised using three different housekeeping genes and networks learned with 
three different structure learnin g al gorithms. 

The approach presented in [20( has two important limitations. First, the 
computational cost of generating the noise floor distribution may discourage its 
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application to large data sets. In fact, the generation of the required permu- 
tations of the data and the subsequent structure learning (in addition to the 
bootstrap resampling and the subsequent learning required for the estimation 
of p) essentially doubles the computational complexity of Friedman's approach. 
Second, a large sample size may result in an extremely low value of max(f), and 
therefore in a large number of false positives. 

In the present study, we re-investigate the myogenic progenitor clonal ex- 
pression data normalised using housekeeping gene GAPDH with the approach 
outlined in Sec. [5] and the IAMB algorithm. It is important to note that this 
strategy was also used in the original study [2(j, hence its choice. The order 
statistic p(.) was computed from 500 bootstrap samples. The empirical cumu- 
lative distribution function Fp,. , the estimated threshold and the network with 
the significant edges are shown in Fig. [8] 

All edges identified as significant in the earlier study [20j across the various 
structure learning techniques and normalisation techniques were also identified 
by the proposed approach (see Fig. 3D in [20l|). In contrast to Fig. [5J the original 
study using IAMB and normalisation with respect to GAPDH alone detected a 
considerable number of additional edges (see Fig. 3 A in [20j). Thus it is quite 
possible that the approach proposed in this paper reduces the number of false 
positives and spurious functional relationships between the genes. Furthermore, 
the application of the proposed approach in conjunction with the algorithm from 
Imoto et al. [3{| reveals directionality of the edges, in contrast to the undirected 
network reported by Nagarajan et al. pol ]. 

4-2. Protein Signalling in Flow Cytometry Data 

In a landmark study, Sachs et al. [2l[ used Bayesian networks for identifying 
causal influences in cellular signalling networks from simultaneous measurement 
of multiple phosphorylated proteins and phospholipids across single cells. The 
authors used a battery of perturbations in addition to the unperturbed data 
to arrive at the final network representation. A greedy search score-based algo- 
rithm that maximises the posterior probability of the network Q and accommo- 
dates for variations in the joint probability distribution across the unperturbed 



and perturbed data sets was used to identify the edges [41] . More importantly 



significant edges were selected using an arbitrary significance threshold of 0.85 



(see Fig. 3, 2l|). A detailed comparison between the learned network and func- 
tional relationships documented in literature was presented in the same study. 

We investigated the performance of the proposed approach in identifying 
significant functional relationships from the same experimental data. However, 
we limit ourselves to the data recorded without applying any molecular inter- 
vention, which amount to 854 observations for 11 variables. We compare and 
contrast our results to those obtained using an arbitrary threshold of 0.85. The 
combination of perturbed and non-perturbed observations studied in Sachs et 



al. [2l[ cannot be analysed with our approach, because each subset of the data 
follows a different probability distribution and therefore there is no single "true" 
network Qo ■ Analysis of the unperturbed data using the approach presented in 
Sec. [5] reveals the edges reported in the original study. The resulting network 



15 



threshold =0.93 



CO 

d 



10 
d 



d 



o 
d 




— I 1 1 1 1 1 — 

0.0 0.2 0.4 0.6 0.8 1.0 




Figure 9: The em piri cal cumulative distribution function of pr.\ for the flow cytometry data 
from Sachs et al. 21] (on the left), and the network structure resulting from the selection of 
the significant edges (on the right). The vertical dashed line in the plot of Fp ^ represents 

the threshold F- 1 (i). 



is shown in Fig. [9] along with Fp,. and the estimated threshold. From the 
plot of Fp. . we can clearly see that significant and non-significant edges present 
widely different levels of confidence, to the point that any threshold between 
0.4 and 0.9 results in the same network structure. This, along with the value 
of the estimated threshold (p^ 0.93), shows that the noisiness of the data 
relative to the sample size is low. In other words, the sample is big enough 
for the structure learning algorithm to reliably select the significant edges. The 
edges identified by the proposed method were the same as those identified by 



2l| using general stimulatory cues excluding the data with interventions (see 



Fig. 4A in [21|, Supplementary Information). In contrast to 21|, using Imoto 
et al. {ii} approach in conjunction with the proposed thresholding method we 
were able to identify the directions of the edges in the network. The directions 
correlated with the functional relationships documented in literature (Tab. 3, 



2l| , Supplementary Information) as well as with the directions of the ed ges in 



the network learned from both perturbed and unperturbed data (Fig. 3, |2l|). 



5. Conclusions 

Graphical models and network abstractions have enjoyed considerable at- 
tention across the biological and medical communities. Such abstractions are 
especially useful in deciphering the interactions between the entities of interest 
from high-throughput observational data. Classical techniques for identifying 
significant edges in the resulting graph rely on ad-hoc thresholding of the edge 
confidence estimated from across multiple independent realisations of networks 
learned from the given data. Large ad-hoc threshold values are particularly 
common, and are chosen in an effort to minimise noisy edges in the resulting 
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network. While useful in minimising false positives, such a choice can accen- 
tuate false negatives with pronounced effect on the network topology. The 
present study overcomes this caveat by proposing a more straightforward and 
statistically-motivated approach for identifying significant edges in a graphical 
model. The proposed estimator minimises the L\ norm between the cumula- 
tive distribution function of the observed confidence levels and the cumulative 
distribution function of their asymptotic, ideal configuration. The effectiveness 



of the proposed approach is demonstrated on three synthetic data sets [28143 



and on gene expression data sets across two different studies [20|,|2l|. However, 
the approach is defined in a more general setting and can be applied to many 
classes of graphical models learned from any kind of data. 
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